library(tidyverse)
library(ggthemes)
library(knitr)
library(broom)
library(stringr)
library(modelr)
library(plotly)
options(digits = 3)
set.seed(1234)
theme_set(theme_minimal())Linear models are the simplest functional form to understand. They adopt a generic form
\[Y = \beta_0 + \beta_{1}X\]
where \(y\) is the outcome of interest, \(x\) is the explanatory or predictor variable, and \(\beta_0\) and \(\beta_1\) are parameters that vary to capture different patterns. In algebraic terms, \(\beta_0\) is the intercept and \(\beta_1\) the slope for the linear equation. Given the empirical values you have for \(x\) and \(y\), you generate a fitted model that finds the values for the parameters that best fit the data.
ggplot(sim1, aes(x, y)) +
geom_point()This looks like a linear relationship. We could randomly generate parameters for the formula \(y = \beta_0 + \beta_1 * x\) to try and explain or predict the relationship between \(x\) and \(y\):
models <- tibble(
a1 = runif(250, -20, 40),
a2 = runif(250, -5, 5)
)
ggplot(sim1, aes(x, y)) +
geom_abline(aes(intercept = a1, slope = a2), data = models, alpha = 1/4) +
geom_point()But obviously some parameters are better than others. We need a definition that can be used to differentiate good parameters from bad parameters.
One approach widely used is called least squares - it means that the overall solution minimizes the sum of the squares of the errors made in the results of every single equation. The errors are simply the difference between the actual values for \(y\) and the predicted values for \(y\) (also known as the residuals).
dist1 <- sim1 %>%
mutate(
dodge = rep(c(-1, 0, 1) / 20, 10),
x1 = x + dodge,
pred = 7 + x1 * 1.5
)
ggplot(dist1, aes(x1, y)) +
geom_abline(intercept = 7, slope = 1.5, color = "grey40") +
geom_point(color = "grey40") +
geom_linerange(aes(ymin = y, ymax = pred), color = "#3366FF")To estimate a linear regression model in R, we use the lm() function. The lm() function takes two parameters. The first is a formula specifying the equation to be estimated (lm() translates y ~ x into \(y = \beta_0 + \beta_1 * x\)). The second is the data frame containing the variables:
sim1_mod <- lm(y ~ x, data = sim1)We can use the summary() function to examine key model components, including parameter estimates, standard errors, and model goodness-of-fit statistics.
summary(sim1_mod)##
## Call:
## lm(formula = y ~ x, data = sim1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.147 -1.520 0.133 1.467 4.652
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.221 0.869 4.86 4.1e-05 ***
## x 2.052 0.140 14.65 1.2e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.2 on 28 degrees of freedom
## Multiple R-squared: 0.885, Adjusted R-squared: 0.88
## F-statistic: 215 on 1 and 28 DF, p-value: 1.17e-14
The resulting line from this regression model looks like:
dist2 <- sim1 %>%
add_predictions(sim1_mod) %>%
mutate(
dodge = rep(c(-1, 0, 1) / 20, 10),
x1 = x + dodge
)
ggplot(dist2, aes(x1, y)) +
geom_smooth(method = "lm", color = "grey40") +
geom_point(color = "grey40") +
geom_linerange(aes(ymin = y, ymax = pred), color = "#3366FF")Linear regression assumes the relationship between the predictors and the response is a straight line. If the true relationship is otherwise, then we cannot generate accurate inferences from the model. Consider the relationship between mpg and horsepower in the ISLR::Auto dataset:
library(ISLR)
data("Auto")
ggplot(Auto, aes(mpg, horsepower)) +
geom_point() +
geom_smooth(se = FALSE)Doesn’t look linear. We could just look at this graph and make that conclusion (a very basic visualization), but what happens once we estimate a model with more than a single predictor? Instead we can use residual plots to identify non-linearity. Recall that the residual is the error for an individual observation \(e_i = y_i - \hat{y}_i\), or the difference between the actual and predicted value for the outcome of interest. Residual plots graph the relationship between the fitted values and the residuals. Ideally there should be no discernable pattern in the graph. If there is a pattern, then this indicates a problem with some aspect of the linear model.
# estimate models
mpg_lin <- lm(horsepower ~ mpg, data = Auto)
mpg_quad <- lm(horsepower ~ poly(mpg, 2, raw = TRUE), data = Auto)
Auto %>%
add_predictions(mpg_lin) %>%
add_residuals(mpg_lin) %>%
ggplot(aes(pred, resid)) +
geom_point(alpha = .2) +
geom_smooth(se = FALSE) +
labs(title = "Residual plot for linear fit")Auto %>%
add_predictions(mpg_quad) %>%
add_residuals(mpg_quad) %>%
ggplot(aes(pred, resid)) +
geom_point(alpha = .2) +
geom_smooth(se = FALSE) +
labs(title = "Residual plot for quadratic fit")Another assumption of linear regression is that the error terms \(\epsilon_i\) have a constant variance, \(\text{Var}(\epsilon_i) = \sigma^2\). This is called homoscedasticity. Estimates of standard errors rely on this assumption. If the variances of the error terms are non-constant (aka heteroscedastic), our estimates of the standard errors will be inaccurate - they will either be inflated or deflated, leading to incorrect inferences about the statistical significance of predictor variables.
We can uncover homo- or heteroscedasticity through the use of the residual plot. Below is data generated from the process:
\[Y = 2 + 3X + \epsilon\]
where \(\epsilon\) is random error distributed normally \(N(0,1)\).
sim_homo <- data_frame(x = runif(1000, 0, 10),
y = 2 + 3 * x + rnorm(1000, 0, 1))
sim_homo_mod <- glm(y ~ x, data = sim_homo)
sim_homo %>%
add_predictions(sim_homo_mod) %>%
add_residuals(sim_homo_mod) %>%
ggplot(aes(pred, resid)) +
geom_point(alpha = .2) +
geom_hline(yintercept = 0, linetype = 2) +
geom_quantile(method = "rqss", lambda = 5, quantiles = c(.05, .95)) +
labs(title = "Homoscedastic variance of error terms",
x = "Predicted values",
y = "Residuals")Compare this to a linear model fit to the data generating process:
\[Y = 2 + 3X + \epsilon\]
where \(\epsilon\) is random error distributed normally \(N(0,\frac{X}{2})\). Note that the variance for the error term of each observation \(\epsilon_i\) is not constant, and is itself a function of \(X\).
sim_hetero <- data_frame(x = runif(1000, 0, 10),
y = 2 + 3 * x + rnorm(1000, 0, (x / 2)))
sim_hetero_mod <- glm(y ~ x, data = sim_hetero)
sim_hetero %>%
add_predictions(sim_hetero_mod) %>%
add_residuals(sim_hetero_mod) %>%
ggplot(aes(pred, resid)) +
geom_point(alpha = .2) +
geom_hline(yintercept = 0, linetype = 2) +
geom_quantile(method = "rqss", lambda = 5, quantiles = c(.05, .95)) +
labs(title = "Heteroscedastic variance of error terms",
x = "Predicted values",
y = "Residuals")We see a distinct funnel-shape to the relationship between the predicted values and the residuals. This is because by assuming the variance is constant, we substantially over or underestimate the actual response \(Y_i\) as \(X_i\) increases.
An outlier is an observation for which the predicted value \(\hat{y}_i\) is far from the actual value \(y_i\). Sometimes outliers are simply coding errors in the original dataset, but sometimes they are extreme or unusual values that were not generated by the same data generating process as the remaining dataset. Detecting outliers is the first step to deciding how to handle them (i.e. keep or remove them from the analysis).
We can use a few different visualizations to detect outliers. In a bivariate linear regression model, simply plot the variables against one another and draw the best-fit line:
sim <- data_frame(x = rnorm(20),
y = -2 + x + rnorm(20)) %>%
mutate(y = ifelse(row_number(x) == 15, y + 10, y),
outlier = ifelse(row_number(x) == 15, TRUE, FALSE))
ggplot(sim, aes(x, y)) +
geom_point(aes(color = outlier)) +
geom_smooth(method = "lm", se = FALSE) +
geom_smooth(data = filter(sim, outlier == FALSE),
method = "lm", se = FALSE, linetype = 2) +
scale_color_manual(values = c("black", "red")) +
theme(legend.position = "none")Or we can draw another residual plot (either raw or standardized):
sim_lm <- lm(y ~ x, data = sim)
sim %>%
add_predictions(sim_lm) %>%
add_residuals(sim_lm) %>%
ggplot(aes(pred, resid)) +
geom_point(aes(color = outlier)) +
labs(x = "Fitted values",
y = "Residuals") +
scale_color_manual(values = c("black", "red")) +
theme(legend.position = "none")augment(sim_lm) %>%
left_join(sim) %>%
ggplot(aes(.fitted, .std.resid)) +
geom_point(aes(color = outlier)) +
labs(x = "Fitted values",
y = "Standardized residuals") +
scale_color_manual(values = c("black", "red")) +
theme(legend.position = "none")High-leverage points are observations with have strong effects on the coefficient estimates in a linear model. Influence is a function of leverage and discrepancy:
\[\text{Influence} = \text{Leverage} \times \text{Discrepancy}\]
Various statistical measures exist for quantifying leverage and discrepancy. Leverage is frequently defined by the leverage statistic (also known as the hat value):
\[h_i = \frac{1}{n} + \frac{(x_i - \bar{x})^2}{\sum_{i'=1}^{n} (x_{i'} - \bar{x})^2}\]
Residuals are commonly used to measure discrepancy (either raw, standardized, or studentized). Cook’s distance (or Cook’s D) combines these two measures to calculate an observation’s leverage:
\[D_i = \frac{\tilde{u}_i^2}{K} \times \frac{h_i}{1 - h_i}\]
where \(\tilde{u}_i^2\) is the squared standardized residual, \(K\) is the number of parameters in the model, and \(\frac{h_i}{1 - h_i}\) is the hat value.
Where is the visualization? By combining all three variables into a “bubble plot”, we can visualize all three variables simultaneously. For example, here are the results of a basic model on the numer of federal laws struck down by the U.S. Supreme Court in each Congress, based on:
library(haven)
dahl <- read_dta("data/LittleDahl.dta")
dahl_mod <-lm(nulls ~ age + tenure + unified, data = dahl)
dahl_augment <- dahl %>%
mutate(hat = hatvalues(dahl_mod),
student = rstudent(dahl_mod),
cooksd = cooks.distance(dahl_mod))
ggplot(dahl_augment, aes(hat, student)) +
geom_point(aes(size = cooksd), shape = 1) +
geom_text(data = dahl_augment %>%
arrange(-cooksd) %>%
slice(1:10),
aes(label = Congress)) +
scale_size_continuous(range = c(1, 20)) +
labs(x = "Leverage",
y = "Studentized residual") +
theme(legend.position = "none")The bubble plot tells us several things:
# Mosaic plot of happiness and education
library(productplots)
data("happy")
happy <- happy %>%
na.omit# contingency table - raw values
happy %>%
count(happy, sex) %>%
spread(sex, n) %>%
kable| happy | male | female |
|---|---|---|
| not too happy | 1904 | 2460 |
| pretty happy | 8760 | 10539 |
| very happy | 4833 | 6327 |
# contingency table - proportions
with(happy, prop.table(table(happy, sex))) %>%
kable| male | female | |
|---|---|---|
| not too happy | 0.055 | 0.071 |
| pretty happy | 0.252 | 0.303 |
| very happy | 0.139 | 0.182 |
# mosaic plot using productplots
prodplot(happy, ~ happy + sex)# add color
prodplot(happy, ~ happy + sex) +
aes(fill = happy)prodplot(happy, ~ happy + marital) +
aes(fill = happy) +
theme(legend.position = "none")# using vcd
library(vcd)
mosaic(~ happy + sex, happy)devtools::session_info()## setting value
## version R version 3.3.3 (2017-03-06)
## system x86_64, darwin13.4.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/Chicago
## date 2017-04-25
##
## package * version date source
## assertthat 0.1 2013-12-06 CRAN (R 3.3.0)
## backports 1.0.5 2017-01-18 CRAN (R 3.3.2)
## base64enc 0.1-3 2015-07-28 CRAN (R 3.3.0)
## broom * 0.4.2 2017-02-13 CRAN (R 3.3.2)
## codetools 0.2-15 2016-10-05 CRAN (R 3.3.3)
## colorspace 1.3-2 2016-12-14 CRAN (R 3.3.2)
## DBI 0.6 2017-03-09 CRAN (R 3.3.3)
## devtools 1.12.0 2016-06-24 CRAN (R 3.3.0)
## digest 0.6.12 2017-01-27 CRAN (R 3.3.2)
## dplyr * 0.5.0 2016-06-24 CRAN (R 3.3.0)
## evaluate 0.10 2016-10-11 CRAN (R 3.3.0)
## expsmooth * 2.3 2015-04-09 CRAN (R 3.3.0)
## fma * 2.3 2017-02-12 CRAN (R 3.3.2)
## forcats 0.2.0 2017-01-23 CRAN (R 3.3.2)
## forecast * 8.0 2017-02-23 CRAN (R 3.3.2)
## foreign 0.8-67 2016-09-13 CRAN (R 3.3.3)
## fpp * 0.5 2013-03-14 CRAN (R 3.3.0)
## fracdiff 1.4-2 2012-12-02 cran (@1.4-2)
## ggplot2 * 2.2.1 2016-12-30 CRAN (R 3.3.2)
## ggthemes * 3.4.0 2017-02-19 CRAN (R 3.3.2)
## gtable 0.2.0 2016-02-26 CRAN (R 3.3.0)
## haven * 1.0.0 2016-09-23 cran (@1.0.0)
## highr 0.6 2016-05-09 CRAN (R 3.3.0)
## hms 0.3 2016-11-22 CRAN (R 3.3.2)
## htmltools 0.3.5 2016-03-21 CRAN (R 3.3.0)
## htmlwidgets 0.8 2016-11-09 CRAN (R 3.3.1)
## httr 1.2.1 2016-07-03 CRAN (R 3.3.0)
## ISLR * 1.0 2013-06-11 CRAN (R 3.3.0)
## jsonlite 1.3 2017-02-28 CRAN (R 3.3.2)
## knitr * 1.15.1 2016-11-22 cran (@1.15.1)
## lattice 0.20-34 2016-09-06 CRAN (R 3.3.3)
## lazyeval 0.2.0 2016-06-12 CRAN (R 3.3.0)
## lmtest * 0.9-35 2017-02-11 CRAN (R 3.3.2)
## lubridate 1.6.0 2016-09-13 CRAN (R 3.3.0)
## magrittr 1.5 2014-11-22 CRAN (R 3.3.0)
## MASS 7.3-45 2016-04-21 CRAN (R 3.3.0)
## Matrix 1.2-8 2017-01-20 CRAN (R 3.3.3)
## MatrixModels * 0.4-1 2015-08-22 CRAN (R 3.3.0)
## memoise 1.0.0 2016-01-29 CRAN (R 3.3.0)
## mnormt 1.5-5 2016-10-15 CRAN (R 3.3.0)
## modelr * 0.1.0 2016-08-31 CRAN (R 3.3.0)
## munsell 0.4.3 2016-02-13 CRAN (R 3.3.0)
## nlme 3.1-131 2017-02-06 CRAN (R 3.3.3)
## nnet 7.3-12 2016-02-02 CRAN (R 3.3.3)
## plotly * 4.5.6 2016-11-12 CRAN (R 3.3.2)
## plyr 1.8.4 2016-06-08 CRAN (R 3.3.0)
## productplots * 0.1.1.9000 2017-04-25 Github (hadley/productplots@391f500)
## psych 1.7.3.21 2017-03-22 CRAN (R 3.3.2)
## purrr * 0.2.2 2016-06-18 CRAN (R 3.3.0)
## quadprog 1.5-5 2013-04-17 cran (@1.5-5)
## quantreg * 5.29 2016-09-04 CRAN (R 3.3.0)
## R6 2.2.0 2016-10-05 CRAN (R 3.3.0)
## Rcpp 0.12.10 2017-03-19 cran (@0.12.10)
## readr * 1.1.0 2017-03-22 cran (@1.1.0)
## readxl 0.1.1 2016-03-28 CRAN (R 3.3.0)
## reshape2 1.4.2 2016-10-22 CRAN (R 3.3.0)
## rmarkdown 1.3 2016-12-21 CRAN (R 3.3.2)
## rprojroot 1.2 2017-01-16 CRAN (R 3.3.2)
## rvest 0.3.2 2016-06-17 CRAN (R 3.3.0)
## scales 0.4.1 2016-11-09 CRAN (R 3.3.1)
## SparseM * 1.74 2016-11-10 CRAN (R 3.3.2)
## stringi 1.1.2 2016-10-01 CRAN (R 3.3.0)
## stringr * 1.2.0 2017-02-18 CRAN (R 3.3.2)
## tibble * 1.2 2016-08-26 cran (@1.2)
## tidyr * 0.6.1 2017-01-10 CRAN (R 3.3.2)
## tidyverse * 1.1.1 2017-01-27 CRAN (R 3.3.2)
## timeDate 3012.100 2015-01-23 cran (@3012.10)
## tseries * 0.10-38 2017-02-20 CRAN (R 3.3.2)
## vcd * 1.4-3 2016-09-17 CRAN (R 3.3.0)
## viridisLite 0.1.3 2016-03-12 cran (@0.1.3)
## withr 1.0.2 2016-06-20 CRAN (R 3.3.0)
## xml2 1.1.1 2017-01-24 CRAN (R 3.3.2)
## yaml 2.1.14 2016-11-12 cran (@2.1.14)
## zoo * 1.7-14 2016-12-16 CRAN (R 3.3.2)